Introduction to Statistical Inference

DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods

Learning Outcomes

  • Understand the concept of a sampling distribution
  • Quantifying the uncertainty of a statistic from a single sample
  • How to construct and interpret a confidence interval for the population mean

Sampling distribution of a statistic

An animation to start with…

Figure: A simple random sample of 10 people (out of 40) where we measured an “unknown” numeric value for each
Figure: person. We then calculated the sample mean, \(\bar{x}\). The population mean \(\mu\)—the parameter—was 175 units

T021: Sampling error revisited

The concept of a sampling error recognises that there is sample to sample—or experiment to experiment—variation whenever we summarise data using the “tools” described in T01

A convenient feature of random samples and random assignment is that we can characterise the distribution of a statistic, such as a sample mean, \(\bar{x}\), or a T-test statistic, \(t_0\)—in way that accounts for sampling error

The survey literature makes a distinction between sampling errors, which arise from the decision to take a sample rather than trying to survey the whole population (which is what a census tries to do)

— Wild & Seber (2000)

Example: Utah’s national parks

We know that Utah (US state) has five national parks, whose population mean area is 261.8 square miles (sq. miles).

What are the possible sample mean areas we could observe, if we took a simple random sample of two national parks?

Figure: The possible sample mean areas if we took a sample random sample of two national parks.

The total 1309. 87.5, 323, 248.5, 174, 291.5, 217, 142.5, 452.5, 378, 303.5

Definition: Sampling distribution of a statistic

A sampling distribution is the distribution of sample statistics computed for different samples of the same size from the same population (or process).

A sampling distribution shows us how the sample statistics varies from sample to sample.

— Lock et al. (2021)

Chocs and Blocks1 simulation study

T011: Shape revisited

The distribution of the possible sample means when \(n = 10\), under simulation, is unimodal (one peak) and symmetrical

What about the shape?

We call this particular shape a bell-shaped distribution, and we can also describe the sampling distribution of the sample mean as approximately Normally distributed.

Figure: The sampling distribution of \(\bar{x}\) when \(n = 10\)

Normally distributed?

The phrase Normally distributed means that we could use the data to build a model to explain the chance of observing an interval of values

The mathematical details relevant for us in DATAX121 is that:

  • The Normal distribution can defined with a population mean, \(\mu\), and population standard deviation, \(\sigma\)
  • It’s a good model for some sampling distributions of statistics. And it tends to “fit” the data better as the sample size increases because of Central Limit Theorem

Figure: The sampling distribution of \(\bar{x}\) when \(n = 10\)

Normal Distribution “fit” for Chocs and Blocks

  • The first Normal distribution curve has a \(\bar{x} = 33.47\) and \(s = 6.21\), when we simulated 5000 (simple) random samples of 10 blocks
  • The second Normal distribution curve has a \(\bar{x} = 33.44\) and \(s = 3.53\), when we simulated 5000 (simple) random samples of 25 blocks
  • The third Normal distribution curve has a \(\bar{x} = 33.55\) and \(s = 2.03\), when we simulated 5000 (simple) random samples of 50 blocks

Sampling distribution of

If the population mean, \(\mu\), and population standard deviation, \(\sigma\), are known—The ground “truths” (parameters) that summarise all possible values we could observe

The sampling distribution of the sample mean, \(\bar{x}\), is

\[ \bar{x} ~ \text{approx.} ~ \text{Normal} \! \left(\mu_\bar{x} = \mu, \sigma_\bar{x} = \frac{\sigma}{\sqrt{n}}\right) \]

The use of the \(\bar{x}\) subscripts is to make it clear that we are talking about the sampling distribution of \(\bar{x}\) and not the possible values we could observe

Applying Slide 12 on Chocs and Blocks

The \(\mu = 33.5\) and \(\sigma = 20.5\) for the population of wood blocks

Sample Size

\(n=10\)

\(n=25\)

\(n=50\)

Estimates (from simulations)

\(\bar{x} = 33.47\) and \(s = 6.21\)

\(\bar{x} = 33.44\) and \(s = 3.53\)

\(\bar{x} = 33.55\) and \(s = 2.03\)

Theoretical (as we know the ground “truths”)

\(\mu_\bar{x} = 33.50\) and \(\sigma_\bar{x} = 6.48\)

\(\mu_\bar{x} = 33.50\) and \(\sigma_\bar{x} = 4.10\)

\(\mu_\bar{x} = 33.50\) and \(\sigma_\bar{x} = 2.05\)

Quantifying uncertainty of a statistic

T021: Population ⇝ Sample ⇢ Population revisited

A conundrum?…

We sample or experiment because we do not know the ground “truths” (parameters). Hence the idea of population ⇝ sample ⇢ population

If we only have one sample, the (descriptive) statistics calculated from this sample are our best estimate of the ground “truths”

Coupled with our understanding sampling distributions of statistics, when we take random samples1

We can quantify the uncertainty of the statistics calculated from the one sample!

How accurate is the sample mean, ?

  1. On Slide 11, it was shown that the sample mean, \(\bar{x}\), is relatively unbiased when we take a random samples1
  2. Additionally, the precision of \(\bar{x}\) improved as we increased the number of observations, \(n\)
  • That is, the standard deviation of the sampling distribution of \(\bar{x}\) decreases as \(n\) increases

Definition: se()

The standard error1 of the sample mean, \(\bar{x}\), is

\[ \text{se}(\bar{x}) = \frac{s}{\sqrt{n}} \]

where:

  • \(s\) is the sample standard deviation2 of the numeric variable
  • \(n\) is the number of observations

CS 2.1 revisited: Replication with light speeds

Recall that the theoretical passage time for Newcomb’s experiment was 24.8296 millionths of a second.

The sample mean, \(\bar{x}\), is:

\(\bar{x} = \ldots = 24.83 ~ (2 ~ \text{dp})\)

The standard error, \(\text{se}(\bar{x})\), is:

\(\text{se}(\bar{x}) =\displaystyle \frac{s}{\sqrt{n}}\)

\(\phantom{\text{se}(\bar{x})} = \displaystyle \frac{0.0051}{\sqrt{20}}\)

\(\phantom{\text{se}(\bar{x})} = 0.0011 ~ (4 ~ \text{dp})\)

# Reading in the dataset
lightspeed.df <- read.csv("datasets/lightspeed.csv")

# Calculate the sample mean
mean(lightspeed.df$pass.time)
[1] 24.82855
# Calculate the sample standard deviation
sd(lightspeed.df$pass.time)
[1] 0.005124503
# The number of observations
nrow(lightspeed.df)
[1] 20
# Use R as a calculator to calculate the standard error
sd(lightspeed.df$pass.time) / sqrt(nrow(lightspeed.df))
[1] 0.001145874

One tool down! And more to come

\(\text{se}(\bar{x})\) should look familiar—revisit Slide 12

The critical difference is that \(\text{se}(\bar{x})\) is quantified with a statistic, whereas the latter, \(\sigma_\bar{x}\), is defined in terms of \(\sigma\), a ground “truth” about the spread of all possible values

Thankfully, we can handle this “minute”1 difference with a different tool, the Student’s t-distribution, another kind of bell-shaped distribution

In particular, the Student’s t-distribution allows us to quantify the precision of \(\bar{x}\), our best estimate of \(\mu\), for any sample size (with some assumptions)

A confidence interval for μ

Definition: Confidence intervals

A confidence interval for a parameter is an interval computed from sample data by a method that will capture the parameter for a specified proportion of all samples.

The success rate (proportion of all samples whose intervals contain the parameter) is known as the confidence level.

— Lock et al. (2021)

Assumptions for a confidence interval for μ

  1. Independent observations—typically met with random samples or randomisation of the data collection order with randomised experiments
  2. Unimodal—one peak
  3. Approximately symmetrical about the sample mean, \(\bar{x}\), and there are no outliers

More on 3.

In practice:

  • We must follow this assumption strictly for “small” datasets (\(n < 20\))
  • We can be lenient with this assumption for “medium” datasets if \(\text{se}(\bar{x})\) is not heavily affected (\(20 \leq n \leq 50\))
  • We can be very lenient with this assumption for “large” datasets (\(n > 50\))

Definition: (1 - α)% Confidence interval for μ

\[ \bar{x} \pm t^*_{1-\alpha/2}(\nu) \times \text{se}(\bar{x}) \]

where:

  • \(\bar{x}\) is the sample mean
  • \(n\) is the number of observations
  • The confidence level is \((1 - \alpha)\), where \(\alpha\) is a proportion
  • The degrees of freedom, \(\nu\)
    • For a \((1 - \alpha)\) C.I. for \(\mu\), we set this to \(\nu = n - 1\)
  • \(t^*_{1-\alpha/2}(\nu)\) is the t-multiplier for the prescribed confidence level of \((1 - \alpha)\)
    • For example, a confidence level of 95% results in \(t^*_{0.975}(\nu)\)
  • \(\text{se}(\bar{x})\) is the standard error of \(\bar{x}\)—see Slide 19

The R function we want to manually replicate

blocks.df <- read.csv("datasets/random-blocks.csv")

t.test(Weight ~ 1, data = blocks.df, conf.level = 0.95)

    One Sample t-test

data:  Weight
t = 5.1279, df = 9, p-value = 0.0006213
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 16.02792 41.33208
sample estimates:
mean of x 
    28.68 

Note

I will provide the relevant R output if you are asked to calculate any confidence interval for a test or exam

CS 1.3 revisited: Block weights

Sample data from a similar woodblock exercise used in the first lecture. The exercise aimed to estimate the average block weight using only a sample of blocks.

Variables
Block.ID An integer between 1–100 denoting the block’s identification number
Weight A number denoting the weight of the block (grams)
stripplot(~ Weight, data = blocks.df, xlab = "Weight (g)", main = "Weight of blocks")

Figure: The weight of the ten blocks

CS 1.3 revisited: Block weights

\[ \bar{x} \pm t^*_{1-\alpha/2}(\nu) \times \text{se}(\bar{x}) \]

# Sample mean
mean(blocks.df$Weight)
[1] 28.68
# Sample standard deviation
sd(blocks.df$Weight)
[1] 17.68639
# Number of observations
nrow(blocks.df)
[1] 10
# Square root of the number of observations
sqrt(nrow(blocks.df))
[1] 3.162278
# t-multiplier
qt(1 - 0.05 / 2, df = nrow(blocks.df) - 1)
[1] 2.262157

5.592928

16.02792 41.33208

On the interpretation of confidence intervals

Style One

We are 95% sure that the mean block weight for the 100 blocks is somewhere between 16.03 and 41.33 grams.

Style Two

With 95% confidence, we estimate that the mean block weight of the 100 blocks is somewhere between 16.03 and 41.33 grams.

Critical features

  • The confidence level
  • That it is an estimate of the ground “truth” (parameter) of …
  • The lower and upper bounds of the confidence interval are quantified with units (where applicable)

Does it work “95%” of the time?

Figure: 50 simple random samples and their 95% CI for the population mean

Sample Size

\(n=10\)

\(n=25\)

\(n=50\)

% Coverage (from simulations)

\(93.00\%\)

\(95.95\%\)

\(98.66\%\)

CS 2.1 revisited2: Replication with light speeds

Simon Newcomb1 experimented with a new method of measuring the speed of light in 1882, which involved using two different mirrors placed approximately 3721.865 metres apart. The following data comes from 20 repeated measurements of the passage time for light to travel from one mirror to another and back again.

The theoretical passage time for the above distance was 24.8296 millionths of a second. If this new method is unbiased and precise, the experimental data should agree with the theoretical passage time.

Variables
pass.time A number denoting the passage time for light to travel from one mirror to another and back again (millionths of a second, μs)
stripplot( ~ pass.time, data = lightspeed.df, jitter.data = TRUE,
          factor = 5, main = "Measurements of passage time for light from Newcomb's experiment",
          xlab = "Passage time (millionths of a second)")

Figure: The passage times for light to travel from one mirror to another and back again

CS 2.1 revisited2: Replication with light speeds

# Calculate and assign a several statistics to their own objects
xbar <- mean(lightspeed.df$pass.time)
n <- nrow(lightspeed.df)
se <- sd(lightspeed.df$pass.time) / sqrt(n)
t.mult <- qt(1 - 0.05 / 2, df = n - 1)

c(xbar, n, se, t.mult)
[1] 24.828550000 20.000000000  0.001145874  2.093024054
# One way of writing R code to manually calculate the 95% CI
xbar - se * t.mult
[1] 24.82615
xbar + se * t.mult
[1] 24.83095
# Another way of writing R code to manually calculate the 95% CI
xbar + c(-1, 1) * se * t.mult
[1] 24.82615 24.83095

Recall that the theoretical passage time for Newcomb’s experiment was 24.8296 millionths of a second.

CS 3.1: The Avogadro constant

The Avogadro constant is a fundamental quantity in chemistry specifying the number of molecules in one mole of a substance. It can only be determined by experiment, for example, using an electrochemical cell. The accepted value is 6.022 × 1023.

A chemist conducts five repeats of an experiment to estimate Avogadro’s constant and obtains the following output

\[ \bar{x} = 5.78 \times 10^{23}, \quad s = 0.20 \times 10^{23}, \quad t^\ast_{0.975}(4) = 2.78, \quad t^\ast_{0.975}(5) = 2.57 \] Construct and interpret a 95% confidence interval for \(\mu\). You may assume the assumptions to construct such an interval have been met.

What a confidence interval is not

  1. A 95% confidence interval contains 95% of the data in the population
  2. I am 95% sure that the mean of a sample will fall within a 95% confidence interval for the mean
  3. The probability (chance) that the population parameter is in this particular 95% confidence interval is 0.95